Like Python R is a object-oriented scripting language, which basically means that you can assign objects to have specific features. Unlike other programming languages, R scripts are small and the syntax is not complicate, which make R generally easier to learn and use. However, like any language, R takes hours to learn the basics and a lifetime to master.
I will not start from R 101 (if you need basic introduction to R, I suggest you to take R Programming - R Language for Absolute Beginners from ND Udemy). Instead, I’ll start with something not terribly complicated but surprisingly useful.
Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time. In tidy data:
A typical journey from raw data to results might involve many steps,
such as filtering cases, transforming values, summarising data, and then
running a statistical test. But one operator can link all these steps
together, while keeping our code efficient and readable, and that is
“pipe”. The pipe operator, written as %>%, takes the
output of one function and passes it into another function as an
argument. This allows us to link a sequence of analysis steps.
# Read in tabular data in the data folder. I don't want R to call strings factors, hence the second command
mining = read.csv('data/CumeMine.csv',stringsAsFactors=F)
# What does this data look like?
head(mining)
## X kentucky tennessee virginia west.virginia X.1
## 1 geometryArea_km2 33,632.72 12,770.67 8,347.36 28,362.39 NA
## 2 1984 472.71 69.47 61.9 210.91 NA
## 3 1985 552.21 75.95 70.69 245.65 NA
## 4 1986 629.47 80.78 77.71 276.34 NA
## 5 1987 684.54 83.9 84.17 304.77 NA
## 6 1988 729.2 91.72 90.82 331.76 NA
## X.2
## 1 does not include Campagna data
## 2 (nor does the cumulativeArea_byCounty sheet)
## 3
## 4
## 5
## 6
# Headers look like state names with some empty columns that have notes in them.
names(mining)
## [1] "X" "kentucky" "tennessee" "virginia"
## [5] "west.virginia" "X.1" "X.2"
# Second column looks like total areas by state.
# Let's extract that column and put it in a different dataset.
# I can pull this out by grabbing the first row and columns 2-5
tote.areas = mining[1,2:5]
# Check
str(tote.areas)
## 'data.frame': 1 obs. of 4 variables:
## $ kentucky : chr "33,632.72"
## $ tennessee : chr "12,770.67"
## $ virginia : chr "8,347.36"
## $ west.virginia: chr "28,362.39"
# Oops those are characters not numbers.
as.numeric(tote.areas)
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
# Still doesn't work? I'm going to have to do some tricks to remove the commas.
tote.area.no.comma = gsub(pattern=',',replacement='',x=tote.areas)
# Now I can convert it to a numeric.
tote.area.num = as.numeric(tote.area.no.comma)
# But we lost the state names, let's bind it back together.
total.areas = data.frame(state=names(mining)[2:5],area=tote.area.num,stringsAsFactors = F)
total.areas
## state area
## 1 kentucky 33632.72
## 2 tennessee 12770.67
## 3 virginia 8347.36
## 4 west.virginia 28362.39
# Now let's fix the original data frame using package called tidyr dplyr and magrittr
library(dplyr) # A great library of data filtering tools
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr) # Easiest way to make data tidy
library(magrittr) # Adds pipe functionality
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:tidyr':
##
## extract
# Reread in data and skip first row this time. And keep only first five columns
mine.dat = read.csv('data/CumeMine.csv',stringsAsFactors=F,skip=2,header=F)[1:5]
# Rename columns
names(mine.dat) = c('year','kentucky','tennessee','virginia','west.virginia')
# Convert to a tidy dataset by 'gathering' data
annual.mining = gather(mine.dat,key=state,value=mining,-year) %>%
select(state,year,mining) %>% arrange(year)
# Tidy data!
head(annual.mining)
## state year mining
## 1 kentucky 1984 472.71
## 2 tennessee 1984 69.47
## 3 virginia 1984 61.90
## 4 west.virginia 1984 210.91
## 5 kentucky 1985 552.21
## 6 tennessee 1985 75.95
With our tidy dataset in hand, plotting is easy. We can look at data in so many different ways.
How do mining rates differ over time and between states? Let’s use ggplot2 and find out!
library(ggplot2)
glines = ggplot(annual.mining,aes(x=year,y=mining,color=state)) + geom_line()
glines
What about cumulative mining extent?
gstack = ggplot(annual.mining,aes(x=year,y=mining,fill=state)) + geom_area(position='stack')
gstack
# I don't like that order, let's reorder the stacking position by using factors
annual.mining$States = factor(annual.mining$state,levels=c('tennessee','virginia','west.virginia','kentucky'))
# Pipes too!
gstack1 = arrange(annual.mining,States) %>% ggplot(aes(x=year,y=mining,fill=States)) + geom_area(position='stack')
gstack1
What about the correlation between mining in West Virginia and Kentucky?
This time our data is not in the right structure to immediately look at this correlation but that is easy to fix using the command “spread” from tidyr package
# First remove factor version of states
spread.mining = select(annual.mining,-States) %>%
spread(key=state,value=mining)
# Back to the original structure.
head(spread.mining)
## year kentucky tennessee virginia west.virginia
## 1 1984 472.71 69.47 61.90 210.91
## 2 1985 552.21 75.95 70.69 245.65
## 3 1986 629.47 80.78 77.71 276.34
## 4 1987 684.54 83.90 84.17 304.77
## 5 1988 729.20 91.72 90.82 331.76
## 6 1989 768.28 97.62 95.73 361.66
ky.wv = ggplot(spread.mining,aes(x=kentucky,y=west.virginia)) + geom_point()
ky.wv
# We can even easily add a linear model to this data.
ky.wv1 = ggplot(spread.mining,aes(x=kentucky,y=west.virginia,label=year)) +
geom_smooth(method='lm') +
geom_point()
ky.wv1
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# And print the model summary.
mine.model = lm(west.virginia~kentucky,data=spread.mining)
summary(mine.model)
##
## Call:
## lm(formula = west.virginia ~ kentucky, data = spread.mining)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.215 -12.174 -5.227 9.273 69.275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.244e+02 1.309e+01 -17.14 <2e-16 ***
## kentucky 7.743e-01 9.494e-03 81.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.55 on 31 degrees of freedom
## Multiple R-squared: 0.9954, Adjusted R-squared: 0.9952
## F-statistic: 6651 on 1 and 31 DF, p-value: < 2.2e-16
# Looks like for every 1 square meter mined in KY there is 0.77 m2 mined in WV.
We can easily add interactivity to these plots using plotly library
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(ky.wv1)
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Now you have enough knowledge to create your own plots. Let’s try to plot cumulative mining as a percentage of total area.
# First we need to join the total area data with the annual mining extent data.
p.mining = left_join(annual.mining,total.areas,by='state') %>%
# Then add a column dividing mining extent by total area
mutate(percent = mining/area)
head(p.mining)
## state year mining States area percent
## 1 kentucky 1984 472.71 kentucky 33632.72 0.014055063
## 2 tennessee 1984 69.47 tennessee 12770.67 0.005439809
## 3 virginia 1984 61.90 virginia 8347.36 0.007415518
## 4 west.virginia 1984 210.91 west.virginia 28362.39 0.007436256
## 5 kentucky 1985 552.21 kentucky 33632.72 0.016418833
## 6 tennessee 1985 75.95 tennessee 12770.67 0.005947221
#Now you make a plot here!